In Part 1 of this workshop, we attempted to predict the value of a
single, continuous outcome variable hourly_wage based on a
range of input features, including age, location, and industry.
Specifically, we covered:
Train-test splits: partitioning the data set into a training set used to train our model, and a test set used to evaluate the model’s performance.
The bias-variance tradeoff: the balance between the model’s ability to capture the true underlying patterns of the training data (bias) and its flexibility to learn from specific instances within that data (variance). This tradeoff matters because it directly affects the generalization ability of the model to new, unseen data.
Pre-processing: cleaning and transforming raw data into a suitable format for analysis, ensuring the data is free of errors, inconsistencies, and irrelevant information. It often includes coding categorical variables and normalizing numerical values to reduce bias and improve model accuracy.
Today, we will apply and expound upon these principles to develop models to predict categorical variables. This process is called ‘classification.’
Before we dive into today’s material, let’s load our libraries
tidymodels and tidyverse libraries.
Thus far, we’ve spent most of our time working through regression problems (i.e., predicting a continuous outcome variable). Let’s switch to a new task: classification. In classification, we aim to predict one of a group of values. For example, predicting a qualitative response is considered classification because we assign the observation to a category (or class).
Like regression, classification is also a supervised learning technique because we have a set of labeled training data that we can use to build a classifier. Now, however, we have access to different techniques because the structure of our outcome variable is categorical.
Identify which of the following are classification problems in machine learning.
An advertiser is interested in the relationship between age and the number of hours of YouTube consumed.
A medical testing company conducts a procedure to determine whether a person has a cancer diagnosis.
A researcher is interested in the effect of an education intervention on students’ test scores.
A software engineer is designing an algorithm to detect whether an email is spam or not.
A political scientist wants to classify Twitter posts as positive or negative.
Solution 1: (2), (4), and (5).
Now, let’s load our primary data set for today’s workshop:
vote2020. Our goal is going to be predicting whether
someone voted in the 2020 election, perhaps to tailor engagement
strategies toward those least likely to vote in an upcoming
election.
vote2020 <- read.csv("../data/vote2020.csv",row.names = NULL)
# visually inspect the data frame
summary(vote2020)
age state sex race
Min. :18.00 Length:69284 Length:69284 Length:69284
1st Qu.:35.00 Class :character Class :character Class :character
Median :51.00 Mode :character Mode :character Mode :character
Mean :50.36
3rd Qu.:65.00
Max. :85.00
marital_status veteran_status citizenship_status hispanic
Length:69284 Min. :0.00000 Length:69284 Min. :0.00000
Class :character 1st Qu.:0.00000 Class :character 1st Qu.:0.00000
Mode :character Median :0.00000 Mode :character Median :0.00000
Mean :0.08448 Mean :0.09562
3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000
education disability residence_duration worker_status
Length:69284 Min. :0.0000 Length:69284 Length:69284
"print" 1st Qu.:0.0000 Class :character Class :character
Mode :character Median :0.0000 Mode :character Mode :character
Mean :0.1378
3rd Qu.:0.0000
Max. :1.0000
occupation voted
Length:69284 Min. :0.0000
Class :character 1st Qu.:1.0000
Mode :character Median :1.0000
Mean :0.7963
3rd Qu.:1.0000
Max. :1.0000
Recall that the first step in machine learning (and most analyses) is to explore the data we’re working with to get a sense of its shape and anticipate problems that may arise.
Perform exploratory analyses on the vote2020 data set,
keeping in mind that today’s goal is to predict voted. What
do you notice about the data?
# create bar chart of voting percentages by state
vote2020 %>%
group_by(state) %>%
summarise(voted_percentage = mean(voted, na.rm = TRUE) * 100) %>%
ungroup() %>%
ggplot(aes(x = reorder(state,-voted_percentage), y = voted_percentage)) +
geom_bar(stat = "identity", fill = "blue") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1)) + # Adjusting x-axis text +
coord_flip() +
scale_y_continuous(limits=c(0,100)) +
labs(title = "Percentage of People Who Voted in Each State",
x = "State",
y = "Percentage Voted") +
theme_minimal()
Recall from Part 1 that a critical first step in machine learning is partitioning our data into training and test sets.
Using the entire vote2020 data set for model training
resulted in impressively high accuracy during evaluation. However, upon
deploying this model to predict voter turnout for an upcoming election,
the predictions significantly diverged from actual turnout, with many
discrepancies in who was predicted to vote versus who actually voted.
The model’s near-perfect performance in development starkly contrasted
its poor real-world prediction outcomes, indicating a potential
oversight in the model training and evaluation process.
Answer 1: The key issue is that the model was never tested on unseen data. By using the entire data set for training, the model essentially “memorized” the data, including any noise or patterns specific to that set of individuals. This led to overfitting, where the model excelled at predicting the training data but failed to generalize to new, unseen data. Since the model’s performance was only evaluated on data it had already seen, its apparent accuracy was misleading, giving a false sense of confidence in its predictive capabilities.
Now that we have a better sense of the data, let’s go ahead and split
vote2020 into training and test sets:
# Perform splits
vote2020$voted <- as.factor(vote2020$voted)
vote_split <- initial_split(vote2020, prop = 0.75)
vote_train <- training(vote_split)
vote_test <- testing(vote_split)
In our example from Part 1, we prepared our data for analysis by
recoding categorical variables and normalizing numeric ones using
step_dummy(all_nominal_predictors()) and
step_normalize, respectively.
In that example, we also dropped rows that had any missing values across variables. Let’s try another example, in which we don’t omit samples that have missing values, but instead perform imputation, in which we replace those missing values according to certain criteria. There are various kinds of imputation, including:
For example, whenever we have a missing value for the
occupation, we can replace that missing value with the most
common occupation. This is called mode imputation.
Or, we could replace a missing numerical predictor (e.g., age) using the median across all the samples. This is called median imputation.
There are other ways to impute, but these are good starting points.
The way to perform imputation in a recipe is via the
step_impute_* functions.
Using the same logic from Part 1, create your own recipe called
voterecipe that lays out the steps for pre-processing the
data. Instead of dropping rows with missing values, use
step_impute_median to impute numeric variables and
step_impute_mode to impute categorical variables.
voterecipe <-
recipe(voted ~ ., data = vote_train) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
This pre-processing will allow us to take advantage of samples with missing data, even if it comes at a little cost to accuracy. Imputation is often a necessary step, since it’s common to have missing data.
Notice that we have only applied our recipe to the training data so far. Explain why it is important to perform imputation separately on the training and test sets rather than imputing missing values before splitting the data set. In your answer, consider the concepts of model evaluation and the application of the model to unseen data.
Answer 2: We want to keep our training and test sets separate. If we impute missing values in the entire data set using the mean, the mean calculation will include information from both the training and test sets. This means the model gets indirectly exposed to information from the test set during training, which can lead to overly optimistic performance estimates and a model that may not perform as well on truly unseen data. This concept is referred to as ‘data leakage.’
Now that we’ve performed exploratory analyses, a train-test split, and pre-processing, let’s go ahead and train our model. We will create a classifier (i.e., a model that predicts membership in a group) with two methods: logistic regression and random forest.
Machine learning practitioners often recommend logistic regression as a starting model when predicting a binary outcome or probability. For example, if we are estimating the relationship between mortality and income we can estimate the probability of mortality given a change in income.
We write this as \(P[M|I]\) and the values will range between 0 and 1. We can make a prediction for any given value of income on the mortality outcome. Normally, we establish a threshold for prediction. For example, we might predict death where \(P[M|i] > 0.5\).
Logistic regression is a generalized linear model where we model the probability function using the logistic function. Define the general function as \(p(Y= 1|X)\). Then the model of interest is:
\[p(Y = 1|\textbf{X}) = \frac{e^{\textbf{X}\beta}}{1 + e^{\textbf{X}\beta}}\] Here the bold X indicates a vector of features and the \(\beta\) coefficient represents a vector of coefficients.
Here’s how to fit classification problems in tidymodels
using logistic regression.
In tidymodels, creating a logistic regression follows
the exact same procedure as a linear regression. This time, however, we
will use logistic_reg() to initiate the function. Let’s
create the model:
# Create model
logistic_model <- logistic_reg(mode = "classification")
Now, we let’s fit the model on the training data:
vote_wflow <- workflow() %>%
add_recipe(voterecipe) %>%
add_model(logistic_model)
vote_fit <- fit(vote_wflow, vote_train)
vote_fit %>% tidy()
Finally, let’s use augment from the yardstick package to
obtain the predictions, and take a look at them:
logistic_predictions <- augment(vote_fit, new_data = vote_test)
logistic_predictions[,14:17] # look at last 4 columns
Notice that new three columns have appeared in our data set. What do these values mean? What are they telling us?
Answer 3: .pred_class corresponds to
the value our algorithm is predicting, i.e., whether the person voted in
the election or not. This was determined using the two variables
.pred_0 and .pred_1, which represent the
predictions from the logistic regression. If membership in group 1
(i.e., voted = 1) is greater than .5, then .pred_class will
take a value of 1.
To evaluate the model, we’ll use the accuracy function
again:
accuracy(logistic_predictions, truth = voted, estimate = .pred_class, type="class")
We predicted the likelihood that someone voted in the last election with an accuracy of about 80% - not bad!
💡 Tip: While we used logistic regression to predict
a binary outcome voted, recall that this prediction is
based on a range of likelihoods from 0 to 1. These predictions can be
useful to supplement causal inferences methods like propensity score
matching that require precise likelihood estimates without having to
explain why.
We can also create what’s called a ‘confusion matrix’ to see how well our model did at predicting each class. A confusion metric is helpful if you care about false positives and false negatives.
logistic_predictions %>%
conf_mat(truth = voted, estimate = .pred_class)
But, there are many other options in our machine learning arsenal to get a better prediction.
Notice that until now, we have used the default settings for the
linear and logistic regressions we have run. Looking at the
documentation for logistic regression, ?logistic_reg(), we
see that there are many arguments that we left blank when we initialized
our model above. These arguments, also known as ‘parameters’, correspond
to statistical choices about how the model should operate or be
structured.
Some of these hyperparameters include ‘engine’, ‘penalty’, and ‘mixture’:
Engine: Different engines can implement regression through various algorithms or computational approaches. When you specify an engine for logistic regression in R, you’re choosing the particular set of algorithms and optimizations that will be used to train your model.
Penalty: “penalty” refers to a regularization technique used to prevent overfitting by discouraging overly complex models. It does this by adding a penalty to the loss function for large coefficients. Common penalties are L1 (Lasso), which can shrink some coefficients to zero (thus performing feature selection), and L2 (Ridge), which shrinks all coefficients toward zero but typically doesn’t set any to exactly zero. The penalty helps in creating simpler, more generalizable models that perform better on unseen data by prioritizing the most influential features and reducing the model’s sensitivity to the training data’s noise.
For example, we can add a ‘penalty’ on the size of the coefficients of a model and reduce the likelihood of overfitting. Lasso, ridge, and elastic net are different types of penalties that greatly reduce or shrink to zero coefficients on variables that are picking up on a lot of noise. The broad technique of reducing overfitting is called ‘regularization.’
We have reproduced the original code from above that trained a basic classifier on our voting data set without changing any of the hyperparameters, as well as the code that obtains predictions. Re-run this code several times, but each time change the hyperparameters in the model specification. For mixture, select values between 0 and 1 inclusive; for penalty, include a non-negative number; and for engine, select either “glmnet” or “glm”. How does the accuracy change?
chal3_model <- logistic_reg(mode = "classification",
engine="glmnet",
penalty=1,
mixture=.5)
chal3_wflow <- workflow() %>%
add_recipe(voterecipe) %>%
add_model(chal3_model)
chal3_fit <- fit(chal3_wflow, vote_train)
chal3_predictions <- augment(chal3_fit, new_data = vote_test)
accuracy(chal3_predictions, truth = voted, estimate = .pred_class, type="class")
It’s nice that we can do different types of regularization, but how do we know what value of the mixture coefficient to pick? In machine learning, this value - which we choose before fitting the model - is known as a hyperparameter. Since hyperparameters are chosen before we fit the model, we can’t just choose them based off the training data. So, how should we go about conducting hyperparameter tuning: identifying the best hyperparameter(s) to use?
Let’s think back to our original goal. We want a model that generalizes to unseen data. So, ideally, the choice of the hyperparameter should be such that the performance on unseen data is the best. We can’t use the test set for this, but what if we had another set of held-out data?
Cue hyperparameter tuning! Hyperparameter tuning is crucial in machine learning as it directly impacts the performance and effectiveness of models. By fine-tuning hyperparameters, practitioners can optimize models to achieve higher accuracy, better generalize to unseen data, and prevent issues like overfitting or underfitting. This process allows for the customization of models to specific data sets and objectives, enabling the discovery of the best configuration for a given problem.
It’s nice that we can do different types of regularization, but how do we know what value of the mixture coefficient to pick? In machine learning, this value - which we choose before fitting the model - is known as a hyperparameter. Since hyperparameters are chosen before we fit the model, we can’t just choose them based off the training data. So, how should we go about conducting hyperparameter tuning: identifying the best hyperparameter(s) to use?
Let’s think back to our original goal. We want a model that generalizes to unseen data. So, ideally, the choice of the hyperparameter should be such that the performance on unseen data is the best. We can’t use the test set for this, but what if we had another set of held-out data?
This is the basis for a validation set. If we had extra held-out data set, we could try a bunch of hyperparameters on the training set, and see which one results in a model that performs the best on the validation set. We then would choose that hyperparameter, and use it to refit the model on both the training data and validation data. We could then, finally, evaluate on the test set.
We just formulated the process of choosing a hyperparameter with a single validation set. However, there are many ways to perform validation. The most common way is cross-validation. Cross-validation is motivated by the concern that we may not choose the best hyperparameter if we’re only validating on a small fraction of the data. If the validation sample, just by chance, contains specific data samples, we may bias our model in favor of those samples, and limit its generalizability.
So, during cross-validation, we effectively validate on the entire training set, by breaking it up into folds. Here’s the process: We can use a process called cross-validation to do select the best
We need to do two things:
The tidymodels suite has two packages to help us with
these steps: tune and rsample.
Let’s illustrate both these packages in the classification example. We already have a recipe set up:
voterecipe
── Recipe ───────────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 13
── Operations
• Median imputation for: all_numeric_predictors()
• Mode imputation for: all_nominal_predictors()
• Dummy variables from: all_nominal_predictors()
When specifying the model, however, we’re going to do something slightly different:
tuned_logistic_model <- logistic_reg(
mixture = tune(),
penalty = tune(),
engine = "glmnet")
We passed in a function called tune(). This signals to
tidymodels that we’d like to tune this hyperparameter. How
do we indicate what values we should test during tuning? There is a
package called dials which allows a variety of ways to
customize this. We’re going to keep things simple and focus on the most
basic choice of tuning: a grid search. In this case, we specify a range
of values, and we’ll test every single one for the hyperparameter. We
use the grid_regular function for this procedure:
# Create grid of parameters
cv_grid <- grid_regular(
mixture(range = c(0, 1)),
# The penalty function is a special dials function
penalty(range = c(-5, 5)),
levels=10)
print(cv_grid)
Next, we need to specify how we will perform cross-validation. From
the rsample package, we can use the function
vfold_cv to create the training folds. In this case,
v is what’s used for “K”.
vote_folds <- vfold_cv(vote_train, v = 5)
vote_folds
# 5-fold cross-validation
We have a tuned model, a grid of hyperparameters, and a set of folds.
We create our workflow as before. But, to train the workflow, we use the
tune_grid function. All the pieces we’ve created are passed
into this function:
# Create workflow
vote_wflow <- workflow() %>%
add_recipe(voterecipe) %>%
add_model(tuned_logistic_model)
# Tune and fit models with tune_grid()
vote_cv_fit <- tune_grid(
# The workflow
vote_wflow,
# The folds we created
resamples = vote_folds,
# The grid of hyperparameters
grid = cv_grid)
There are some nice plotting functions we can use to visualize how
the performance varies as a function of the regularization. For example,
check out the autoplot() function:
autoplot(vote_cv_fit)
What does this tell us about how much regularization we should use?
Of course, we can automate this procedure. The
select_best function will do this for us:
# Select best metric according to rsq
vote_cv_best <- select_best(vote_cv_fit, metric = "accuracy")
vote_cv_best
We can see a penalty column, where it appears to have chosen the smallest penalty. What do we do at this point?
Recall that, during cross-validation, we split up the data and examine performance across many folds. Now that we know what is likely the best penalty, we can re-train on the entire training set.
We do this with the finalize_workflow function.
# Get our final model and finalize workflow
cv_final <- vote_wflow %>%
finalize_workflow(parameters = vote_cv_best) %>%
fit(data = vote_train)
cv_final %>% tidy()
And lastly, we’ll examine the performance on the test set:
cv_predictions <- augment(cv_final, new_data = vote_test)
accuracy(cv_predictions, truth = voted, estimate = .pred_class, type="class")
If we compare this to the model we fit earlier where we made a guess at the best value for the penalty hyperparameter, it reveals that we that we can fit better models via cross-validation than with just a single training set. This is reflected in the higher accuracy.
Congratulations, you’ve made it! We covered the basics of supervised
machine learning in tidymodels in this workshop. However,
there’s much more to explore. The best way to keep pushing forward is to
choose a problem to study, and refer to the documentation when you need
help. The website Kaggle has an abundance of good data science problems
to work on if you need help choosing a task!